In this article, we will see how to add a constant column to the spark DataFrame.
What is a constant column?
The constant column is said to be a column in a DataFrame that has the same value for every row. There might be several use cases where we might need the constant column in the DataFrame. For example, to add metadata or to set default values.
How to add a constant column?
- Create a new column with the same constant value for every row
- Use the
withColumnmethod to add the new column to the DataFrame
Let us see a simple example where we have a DataFrame with three columns such as name, age, and gender, and add a constant column called the city with the value
San Francisco for every row. Following is how to do it.
# Import the required modules from pyspark.sql.functions import lit # Create a sample DataFrame data = [("Alice", 25, "Female"), ("Bob", 30, "Male"), ("Charlie", 35, "Male")] df = spark.createDataFrame(data, ["name", "age", "gender"]) # Add a constant column df = df.withColumn("city", lit("San Francisco"))
In the above example, we import the
lit function from the
pyspark.sql.functions module by using it we can create a new column with a constant value. We then create a sample DataFrame with the
createDataFrame method and pass it some sample data.
withColumn method we add a new column to the DataFrame. The first argument is the name of the new column, i.e. “city” and the second argument is the value that we want to use for the new column, which is created using the
lit function with the value “San Francisco“.
After adding the new column, we can display the DataFrame using the following method to confirm the change that we made.
# Display the updated DataFrame df.show()
Following is the output:
+-------+---+------+-------------+ | name|age|gender| city| +-------+---+------+-------------+ | Alice| 25|Female|San Francisco| | Bob| 30| Male|San Francisco| |Charlie| 35| Male|San Francisco| +-------+---+------+-------------+
In the above example, we can see the new column “city” has been added to the DataFrame with the constant value “San Francisco” for every row.