In this article, we will see how to add a constant column to the spark DataFrame.

What is a constant column?

The constant column is said to be a column in a DataFrame that has the same value for every row. There might be several use cases where we might need the constant column in the DataFrame. For example, to add metadata or to set default values.

How to add a constant column?

  1. Create a new column with the same constant value for every row
  2. Use the withColumn method to add the new column to the DataFrame

Let us see a simple example where we have a DataFrame with three columns such as name, age, and gender, and add a constant column called the city with the value San Francisco for every row. Following is how to do it.

# Import the required modules
from pyspark.sql.functions import lit

# Create a sample DataFrame
data = [("Alice", 25, "Female"), ("Bob", 30, "Male"), ("Charlie", 35, "Male")]
df = spark.createDataFrame(data, ["name", "age", "gender"])

# Add a constant column
df = df.withColumn("city", lit("San Francisco"))

In the above example, we import the lit function from the pyspark.sql.functions module by using it we can create a new column with a constant value. We then create a sample DataFrame with the createDataFrame method and pass it some sample data.

Using the withColumn method we add a new column to the DataFrame. The first argument is the name of the new column, i.e. “city” and the second argument is the value that we want to use for the new column, which is created using the lit function with the value “San Francisco“.

After adding the new column, we can display the DataFrame using the following method to confirm the change that we made.

# Display the updated DataFrame
df.show()

Following is the output:

+-------+---+------+-------------+
|   name|age|gender|         city|
+-------+---+------+-------------+
|  Alice| 25|Female|San Francisco|
|    Bob| 30|  Male|San Francisco|
|Charlie| 35|  Male|San Francisco|
+-------+---+------+-------------+

In the above example, we can see the new column “city” has been added to the DataFrame with the constant value “San Francisco” for every row.

Categorized in:

Tagged in: