Académique Documents
Professionnel Documents
Culture Documents
Checkpoint 1
>>> val df =
spark.read.option("inferSchema","true").csv("aadhaar_data.csv").toDF("date",
"registrar", "private_agency", "state", "district", "sub_district", "pincode",
"gender", "age", "aadhaar_generated", "rejected", "mobile_number", "email_id")
>>> df.registerTempTable("data")
Checkpoint 2
>>>df.printSchema()
>>> df.select("registrar").distinct().show()
>>> df.select("registrar").distinct().count()
3. Find the number of states, districts in each state and sub-districts in each
district.
>>> df.select("state").distinct().count()
>>> spark.sqlContext.sql("SELECT state, COUNT(district) AS district_count
FROM data GROUP BY state ORDER BY COUNT(district) DESC").show()
>>> spark.sqlContext.sql("SELECT district, COUNT(sub_district) AS
sub_district_count FROM data GROUP BY district ORDER BY
COUNT(sub_district) DESC").show()
Checkpoint3
>>>val generated=df.groupBy("district").sum("aadhaar_generated")
>>>val rejected=df.groupBy("district").sum("rejected")
>>>val concat= generated.withColumn("id",
monotonically_increasing_id()).join(rejected.withColumn("id",monotonically_incre
asing_id()), Seq("id")).drop("id")
>>>val final_op = concat.withColumn("no_of_enrolments, $"sum(aadhaar_generated)"
+ $"sum(rejected)")
>>>println("top 3 districts where enrolment numbers are maximum along with the
number of enrolments")
>>>final_op.show(3,false)
Checkpoint 4:
>>>df.select("pincode").distinct.show()
Checkpoint 5
1. Find the top 3 states where the percentage of Aadhaar cards being
generated for males is the highest.