ForeachWriter

The abstract class for writing custom logic to process data generated by a query. This is often used to write the output of a streaming query to arbitrary storage systems. Any implementation of this base class will be used by Spark in the following way.

A single instance of this class is responsible of all the data generated by a single task in a query. In other words, one instance is responsible for processing one partition of the data generated in a distributed manner.
Any implementation of this class must be serializable because each task will get a fresh serialized-deserialized copy of the provided object. Hence, it is strongly recommended that any initialization for writing data (e.g. opening a connection or starting a transaction) is done after the open(...) method has been called, which signifies that the task is ready to generate data.
The lifecycle of the methods are as follows.

  For each partition with `partitionId`:
      For each batch/epoch of streaming data (if its streaming query) with `epochId`:
          Method `open(partitionId, epochId)` is called.
          If `open` returns true:
               For each row in the partition and batch/epoch, method `process(row)` is called.
          Method `close(errorOrNull)` is called with error (if any) seen while processing rows.

Important points to note:

The partitionId and epochId can be used to deduplicate generated data when failures cause reprocessing of some input data. This depends on the execution mode of the query. If the streaming query is being executed in the micro-batch mode, then every partition represented by a unique tuple (partitionId, epochId) is guaranteed to have the same data. Hence, (partitionId, epochId) can be used to deduplicate and/or transactionally commit data and achieve exactly-once guarantees. However, if the streaming query is being executed in the continuous mode, then this guarantee does not hold and therefore should not be used for deduplication.
The close() method will be called if open() method returns successfully (irrespective of the return value), except if the JVM crashes in the middle.

Scala example:

datasetOfString.writeStream.foreach(new ForeachWriter[String] {

  def open(partitionId: Long, version: Long): Boolean = {
    // open connection
  }

  def process(record: String) = {
    // write string to connection
  }

  def close(errorOrNull: Throwable): Unit = {
    // close the connection
  }
})

Java example:

datasetOfString.writeStream().foreach(new ForeachWriter<String>() {

  @Override
  public boolean open(long partitionId, long version) {
    // open connection
  }

  @Override
  public void process(String value) {
    // write string to connection
  }

  @Override
  public void close(Throwable errorOrNull) {
    // close the connection
  }
});

Annotations: @Evolving()
Source: ForeachWriter.scala
Since: 2.0.0

Linear Supertypes

Serializable, Serializable, AnyRef, Any

Instance Constructors

new ForeachWriter()

Abstract Value Members

abstract def close(errorOrNull: Throwable): Unit

Called when stopping to process one partition of new data in the executor side.
Called when stopping to process one partition of new data in the executor side. This is guaranteed to be called either open returns true or false. However, close won't be called in the following cases:
- JVM crashes without throwing a Throwable
- open throws a Throwable.
errorOrNull
the error thrown during processing data or null if there was no error.
abstract def open(partitionId: Long, epochId: Long): Boolean

Called when starting to process one partition of new data in the executor.
Called when starting to process one partition of new data in the executor. See the class docs for more information on how to use the partitionId and epochId.
partitionId
the partition id.
epochId
a unique id for data deduplication.
returns
true if the corresponding partition and version id should be processed. false indicates the partition should be skipped.
abstract def process(value: T): Unit

Called to process the data in the executor side.
Called to process the data in the executor side. This method will be called only if open returns true.

Concrete Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
AnyRef → Any
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Doc: package sql

abstract class ForeachWriter[T] extends Serializable

Instance Constructors

new ForeachWriter()

Abstract Value Members

abstract def close(errorOrNull: Throwable): Unit

abstract def open(partitionId: Long, epochId: Long): Boolean

abstract def process(value: T): Unit

Concrete Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

def clone(): AnyRef

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Ungrouped