pyspark.sql.functions.regexp_instr#

pyspark.sql.functions.regexp_instr(str, regexp, idx=None)[source]#

Returns the position of the first substring in the str that match the Java regex regexp and corresponding to the regex group index.

New in version 3.5.0.

Parameters

strColumn or column name: target column to work on.
regexpColumn or column name: regex pattern to apply.
idxColumn or int, optional: matched group id.

Returns

Column: the position of the first substring in the str that match a Java regex and corresponding to the regex group index.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("1a 2b 14m", r"\d+(a|b|m)")], ["str", "regexp"])

Example 1: Returns the position of the first substring in the str column name that match the regex pattern (d+(a|b|m)) (one or more digits followed by ‘a’, ‘b’, or ‘m’).

>>> df.select('*', sf.regexp_instr('str', sf.lit(r'\d+(a|b|m)'))).show()
+---------+----------+--------------------------------+
|      str|    regexp|regexp_instr(str, \d+(a|b|m), 0)|
+---------+----------+--------------------------------+
|1a 2b 14m|\d+(a|b|m)|                               1|
+---------+----------+--------------------------------+

Example 2: Returns the position of the first substring in the str column name that match the regex pattern (d+(a|b|m)) (one or more digits followed by ‘a’, ‘b’, or ‘m’),

>>> df.select('*', sf.regexp_instr('str', sf.lit(r'\d+(a|b|m)'), sf.lit(1))).show()
+---------+----------+--------------------------------+
|      str|    regexp|regexp_instr(str, \d+(a|b|m), 1)|
+---------+----------+--------------------------------+
|1a 2b 14m|\d+(a|b|m)|                               1|
+---------+----------+--------------------------------+

Example 3: Returns the position of the first substring in the str column name that match the regex pattern in regexp Column.

>>> df.select('*', sf.regexp_instr('str', sf.col("regexp"))).show()
+---------+----------+----------------------------+
|      str|    regexp|regexp_instr(str, regexp, 0)|
+---------+----------+----------------------------+
|1a 2b 14m|\d+(a|b|m)|                           1|
+---------+----------+----------------------------+

Example 4: Returns the position of the first substring in the str Column that match the regex pattern in regexp column name.

>>> df.select('*', sf.regexp_instr(sf.col("str"), "regexp")).show()
+---------+----------+----------------------------+
|      str|    regexp|regexp_instr(str, regexp, 0)|
+---------+----------+----------------------------+
|1a 2b 14m|\d+(a|b|m)|                           1|
+---------+----------+----------------------------+