If you’re working with data in Google Sheets you’ll soon come across a time when you will need to clean and format phone number data entries.
To clean phone numbers in Google Sheets using the
REGEXEXTRACT()
function extract all the different fields according to the phone number entries and then combine them all into the desired format.
Here’s an example walking step by step through the process of cleaning the data set first, before then moving on to creating the desired phone number format.
Survey Current Phone Number Formats
Suppose I had the following USA phone number data set:
A | |
---|---|
1 | +1 (555) 123 4567 |
2 | (555) 123 4567 |
3 | 555-123-4567 |
4 | 123.4567 |
As you can see from the example data set the data isn’t clean. Anyone and everyone who was entering phone number data into this field was entering the data according to their own personal taste.
To clean this data set up I’m going to use the
REGEXEXTRACT()
formula to help extract the phone groups that should have been in each cell, the groups are as follows:
- Plus symbol and a one to three-digit country code (1 being USA)
- Area code of three digits, usually in brackets
- Three digits representing the prefix number
- Four digits representing the subscriber number
With this format in mind, you can represent the above as a regular expression like so:
\+?\d{1,3}?\D?\(?\d{3}?\)?\D?\d{3}\D?\d{4}
Let’s break the regular expression apart so that you can understand how each expression corresponds to the phone number format:
Regular Expression | Definition |
---|---|
\+?
|
The initial
+
sign denoting a country code. As the
+
sign is a special character in regular expressions it is escaped
\
and as some phone numbers might not have this character I’ve appended the optional character
?
|
\d{1,3}?
|
This expression looks for a digit between 1 and 3 digits long and as all cells may not have the country code digits this expression has also been appended with the optional character
?
|
\D?
|
This expression looks for a non-digit character which can be anything else besides a number. Again, as some fields will not have the country code you will need to make it optional by appending the
?
|
\(?
|
Next, is another expression which is escaped
\
and therefore seeks the exact character
(
and is appended with the optional character
?
as some fields will have the area code but are not wrapped in parentheses.
|
\d{3}?
|
The area code is exactly three digits long and is optional. Some area codes in other countries, like Australia, would have area codes of only two digits, therefore, you would amend this to be
\d{2}?
|
\)?
|
Closing the area code with the exact parentheses and making it optional. |
\D?
|
Optional non-digit character. |
\d{3}
|
Searching for the prefix three-digit number and this expression isn’t optional. This is the bare minimum needed with a phone number. |
\D?
|
Optional non-digit character between the prefix and subscriber number |
\d{4}
|
Finishing with the four-digit subscriber number which is a needed expression. |
Depending upon your phone number data set you may need to remove sections to match your structure. For example, if your phone numbers don’t contain the country code then you will not need the expression
\+?\d{1,3}?\D?
or if you don’t have parentheses around your area code then you can remove the
\(?
and
\)?
sections from the regular expression.
Without a country code or parentheses around an area code your regular expression could look something as simple as this instead:
\d{3}?\D?\d{3}\D?\d{4}
Capture Country, Area, Prefix And Subscriber Sections From Phone Number
The final aspect needed before inserting the above regular expression into the
REGEXEXTRACT()
formula is to set what you want to
capture
from the expression.
In my example, I want to capture all four sections from the phone number fields: the country code, area code, prefix and subscriber numbers. To do this I need to set the captures in the regular expression and the way this is done is by wrapping the expressions in parentheses.
Using my original regular expression here’s how capturing the country code, area code, prefix and subscriber sections would look:
\+?(\d{1,3})?\D?\(?(\d{3})?\)?\D?(\d{3})\D?(\d{4})
Notice the newly inserted parentheses (highlighted in bold).
When this regular expression is inserted into your Google Sheet phone number data, in an adjacent column, with the
REGEXEXTRACT(A1, "\+?(\d{1,3})?\D?(?(\d{3})?)?\D?(\d{3})\D?(\d{4})
“
)
formula it produces the following result:
A | B | C | D | E | |
---|---|---|---|---|---|
1 | +1 (555) 123 4567 | 1 | 555 | 123 | 4567 |
2 | (555) 123 4567 | 555 | 123 | 4567 | |
3 | 555-123-4567 | 555 | 123 | 4567 | |
4 | 123.4567 | 123 | 4567 |
=REGEXEXTRACT(A1,"\+?(\d{1,3})?\D?(?(\d{3})?)?\D?(\d{3})\D?(\d{4})")
inserted into cell
B1
and copied down
As you can see the regular expression extracts each section from the phone numbers in column A and inserts their section into the 4 columns alongside each column representing one of the captured sections.
Concatenate Phone Number Into Custom String
To translate this into your custom phone number format it would now be a case of adding another column and where a value doesn’t exist in a column to replace it with what should have been inserted.
For example, if these numbers are all USA phone numbers then for column B we could insert the string
"+1"
as a substitute (as 1 represents the country code for USA); if column B is meant to be a local area code of
555
then this would be the substituted value should that
column be blank
.
Therefore, wrapping this all up into a formula and applying spacing between each section you could insert the following formula in column F:
=CONCATENATE("+",IF(LEN(B1),B1,"1")," (",IF(LEN(C1),C1,"555"),") ",D1," ",E1)
Here is the result in our working spreadsheet:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | +1 (555) 123 4567 | +1 | 555 | 123 | 4567 | +1 (555) 123 4567 |
2 | (555) 123 4567 | 555 | 123 | 4567 | +1 (555) 123 4567 | |
3 | 555-123-4567 | 555 | 123 | 4567 | +1 (555) 123 4567 | |
4 | 123.4567 | 123 | 4567 | +1 (555) 123 4567 |
CONCATENATE
to construct your own phone number format
As you can see the
CONCATENATE()
function has allowed you to design your own custom phone number format and the result is a neatly designed string from the “dirty data” of phone numbers in column A.
Phone Number As Custom Number Format
An alternative to constructing your own custom phone number string using the
CONCATENATE()
formula from the result of the
REGEXEXTRACT()
function is to create a phone number as a number and to apply a custom number format for that column.
This may prove handy should the data be issued to others and they prefer different formats to phone numbers.
First step with this approach is to apply the same concatenation technique but to convert it to a number using the
VALUE()
formula. This would look something like this in column F instead:
=VALUE(CONCATENATE(IF(LEN(B1),B1,"1"),IF(LEN(C1),C1,"555"),D1,E1))
Notice the formula has a slight difference from above: the removal of the
"+"
country code prefix, spacing and parentheses. Removing these elements will help make it easier for the
VALUE()
function to translate the string to a number, if you do get an error check these elements have been removed from your
CONCATENATE()
function.
The result of the above formula would now look something like this in your spreadsheet:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | +1 (555) 123 4567 | +1 | 555 | 123 | 4567 | 15551234567 |
2 | (555) 123 4567 | 555 | 123 | 4567 | 15551234567 | |
3 | 555-123-4567 | 555 | 123 | 4567 | 15551234567 | |
4 | 123.4567 | 123 | 4567 | 15551234567 |
VALUE()
and
CONCATENATE()
As you can see the result of this combination of formulas produces one large number.
Change Phone Number Format Using Format Menu
To change a number into a phone number format you can use the
Format
menu item and navigate to Number and then
Custom Number Format
.
The modal window that pops up will then need a format for how you would like to display the number. If you stick with the common format of
+# (###) ### ###
then you would type into the input box next to the Apply button that format, such that it looks like this:
The
#
symbol in the
Custom Number Format
field represents a number.
Once you apply this to your column with phone numbers you will see column F change to this:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | +1 (555) 123 4567 | +1 | 555 | 123 | 4567 | +1 (555) 123 4567 |
2 | (555) 123 4567 | 555 | 123 | 4567 | +1 (555) 123 4567 | |
3 | 555-123-4567 | 555 | 123 | 4567 | +1 (555) 123 4567 | |
4 | 123.4567 | 123 | 4567 | +1 (555) 123 4567 |
Then should a user prefer a different format they can apply their own Custom Number Format field accordingly.
For example, if a user prefers square brackets and a hyphen between the prefix and subscriber numbers then their format will look something like this:
+# [###] ###-####
Summary
Cleaning phone numbers in a spreadsheet, like Google Sheets, is made easier with the handy
REGEXEXTRACT()
function. This function enables users to apply one regular expression to capture all the phone number entries and then by using
CONCATENATE()
can then change these captures into a constant phone number format.
Check our other post on this topic where I explore how you can use Python and regular expressions to check phone numbers too.