pyspark.sql.functions.regexp_instr#

pyspark.sql.functions.regexp_instr(str, regexp, idx=None)[source]#

Returns the position of the first substring in the str that match the Java regex regexp and corresponding to the regex group index.

New in version 3.5.0.

Parameters
strColumn or column name

target column to work on.

regexpColumn or column name

regex pattern to apply.

idxColumn or int, optional

matched group id.

Returns
Column

the position of the first substring in the str that match a Java regex and corresponding to the regex group index.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.createDataFrame([("1a 2b 14m", r"\d+(a|b|m)")], ["str", "regexp"])

Example 1: Returns the position of the first substring in the str column name that match the regex pattern (d+(a|b|m)) (one or more digits followed by ‘a’, ‘b’, or ‘m’).

>>> df.select('*', sf.regexp_instr('str', sf.lit(r'\d+(a|b|m)'))).show()
+---------+----------+--------------------------------+
|      str|    regexp|regexp_instr(str, \d+(a|b|m), 0)|
+---------+----------+--------------------------------+
|1a 2b 14m|\d+(a|b|m)|                               1|
+---------+----------+--------------------------------+

Example 2: Returns the position of the first substring in the str column name that match the regex pattern (d+(a|b|m)) (one or more digits followed by ‘a’, ‘b’, or ‘m’),

>>> df.select('*', sf.regexp_instr('str', sf.lit(r'\d+(a|b|m)'), sf.lit(1))).show()
+---------+----------+--------------------------------+
|      str|    regexp|regexp_instr(str, \d+(a|b|m), 1)|
+---------+----------+--------------------------------+
|1a 2b 14m|\d+(a|b|m)|                               1|
+---------+----------+--------------------------------+

Example 3: Returns the position of the first substring in the str column name that match the regex pattern in regexp Column.

>>> df.select('*', sf.regexp_instr('str', sf.col("regexp"))).show()
+---------+----------+----------------------------+
|      str|    regexp|regexp_instr(str, regexp, 0)|
+---------+----------+----------------------------+
|1a 2b 14m|\d+(a|b|m)|                           1|
+---------+----------+----------------------------+

Example 4: Returns the position of the first substring in the str Column that match the regex pattern in regexp column name.

>>> df.select('*', sf.regexp_instr(sf.col("str"), "regexp")).show()
+---------+----------+----------------------------+
|      str|    regexp|regexp_instr(str, regexp, 0)|
+---------+----------+----------------------------+
|1a 2b 14m|\d+(a|b|m)|                           1|
+---------+----------+----------------------------+