How does one even know they have duplicates ? or to what extent ? You can use this SQL statement right here:
|title, url, count(*)|
|count(title) > 1|
The key here is that we are using the “group by” clause to aggregate a bunch of data. And after that we are using the “having” filter clause, to add the filter “count(title) > 1” which is just saying “title found more then once”
Now that we have made sure that we actually do have duplicates, lets get down to deleting them.
|FROM jobtransparncyprod.JobPostings dupes, jobtransparncyprod.JobPostings fullTable|
|WHERE dupes.url = fullTable.url|
|AND dupes.title = fullTable.title|
|AND dupes.id > fullTable.id|
The key here is that we are using aliases rather than joins or something else. So you can think of “dupes” and “fulltable” as variables. First we set the values of these variables, then we use the where clause to filter the things we are looking for. At this point if we just ran the query, we would just end up deleting our whole table. Therefor we have a final “and” where we specify that we only want to select the id that greater out of the two, rows that were in “dupes” and “fulltable”.