pyspark.sql.functions.shuffle#
- pyspark.sql.functions.shuffle(col, seed=None)[source]#
Array function: Generates a random permutation of the given array.
New in version 2.4.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- Returns
Column
A new column that contains an array of elements in random order.
Notes
The shuffle function is non-deterministic, meaning the order of the output array can be different for each execution.
Examples
Example 1: Shuffling a simple array
>>> import pyspark.sql.functions as sf >>> df = spark.sql("SELECT ARRAY(1, 20, 3, 5) AS data") >>> df.select("*", sf.shuffle(df.data, sf.lit(123))).show() +-------------+-------------+ | data|shuffle(data)| +-------------+-------------+ |[1, 20, 3, 5]|[5, 1, 20, 3]| +-------------+-------------+
Example 2: Shuffling an array with null values
>>> import pyspark.sql.functions as sf >>> df = spark.sql("SELECT ARRAY(1, 20, NULL, 5) AS data") >>> df.select("*", sf.shuffle(sf.col("data"), 234)).show() +----------------+----------------+ | data| shuffle(data)| +----------------+----------------+ |[1, 20, NULL, 5]|[NULL, 5, 20, 1]| +----------------+----------------+
Example 3: Shuffling an array with duplicate values
>>> import pyspark.sql.functions as sf >>> df = spark.sql("SELECT ARRAY(1, 2, 2, 3, 3, 3) AS data") >>> df.select("*", sf.shuffle("data", 345)).show() +------------------+------------------+ | data| shuffle(data)| +------------------+------------------+ |[1, 2, 2, 3, 3, 3]|[2, 3, 3, 1, 2, 3]| +------------------+------------------+
Example 4: Shuffling an array with random seed
>>> import pyspark.sql.functions as sf >>> df = spark.sql("SELECT ARRAY(1, 2, 2, 3, 3, 3) AS data") >>> df.select("*", sf.shuffle("data")).show() +------------------+------------------+ | data| shuffle(data)| +------------------+------------------+ |[1, 2, 2, 3, 3, 3]|[3, 3, 2, 3, 2, 1]| +------------------+------------------+