How to save result of printSchema to a file in PySpark
Andrew Henderson
I have used df.printSchema() in pyspark and it gives me the schema with tree structure. Now i need to save it in a variable or a text file.
I have tried below methods of saving but they didn't work.
v = str(df.printSchema())
print(v)
#and
df.printSchema().saveAsTextFile(<path>)I need the saved schema in below format
|-- COVERSHEET: struct (nullable = true) | |-- ADDRESSES: struct (nullable = true) | | |-- ADDRESS: struct (nullable = true) | | | |-- _VALUE: string (nullable = true) | | | |-- _city: string (nullable = true) | | | |-- _primary: long (nullable = true) | | | |-- _state: string (nullable = true) | | | |-- _street: string (nullable = true) | | | |-- _type: string (nullable = true) | | | |-- _zip: long (nullable = true) | |-- CONTACTS: struct (nullable = true) | | |-- CONTACT: array (nullable = true) | | | |-- element: struct (containsNull = true) | | | | |-- _VALUE: string (nullable = true) | | | | |-- _name: string (nullable = true) | | | | |-- _type: string (nullable = true) 0 2 Answers
You need treeString (which for some reason, I couldn't find in the python API)
#v will be a string
v = df._jdf.schema().treeString()You can convert it to a RDD and use saveAsTextFile
sc.parallelize([v]).saveAsTextFile(...)Or use Python specific API to write a String to a file.
1You can also use the following:
temp_rdd = sc.parallelize(schema)
temp_rdd.coalesce(1).saveAsPickleFile("s3a://path/to/destination_schema.pickle")