docs - regexextract (#25728)

033f47da · Natalie · GitHub · 467d7e0a · 033f47da · 033f47da
Unverified Commit 033f47da authored 2 years ago by Natalie Committed by GitHub 2 years ago
--- a/docs/questions/query-builder/expressions-list.md
+++ b/docs/questions/query-builder/expressions-list.md
@@ -43,7 +43,7 @@ For an introduction to expressions, check out [Writing expressions in the notebo
  - [log](#log)
  - [lower](#lower)
  - [power](#power)
-  - [regexextract](#regexextract)
+  - [regexextract](./expressions/regexextract.md)
  - [replace](#replace)
  - [righttrim](#righttrim)
  - [round](#round)
@@ -224,7 +224,7 @@ Example: `case([Weight] > 200, "Large", [Weight] > 150, "Medium", "Small")` If a

 ### ceil

-Rounds a decimal up (ciel as in ceiling).
+Rounds a decimal up (ceil as in ceiling).

 Syntax: `ceil(column)`.

@@ -362,7 +362,7 @@ Databases that don't support `power`: SQLite.

 Related: [exp](#exp).

-### regexextract
+### [regexextract](./expressions/regexextract.md)

 Extracts matching substrings according to a regular expression.

@@ -372,7 +372,7 @@ Example: `regexextract([Address], "[0-9]+")`.

 Databases that don't support `regexextract`: H2, SQL Server, SQLite.

-Related: [contains](#contains).
+Related: [contains](#contains), [substring](#substring).

 ### replace

@@ -430,7 +430,7 @@ Syntax: `substring(text, position, length)`

 Example: `substring([Title], 0, 10)` returns the first 11 letters of a string (the string index starts at position 0).

-Related: [replace](#replace).
+Related: [regexextract](#regexextract), [replace](#replace).

 ### trim


--- a/docs/questions/query-builder/expressions/regexextract.md
+++ b/docs/questions/query-builder/expressions/regexextract.md
+---
+title: Regexextract
+---
+
+# Regexextract
+
+`regexextract` uses [regular expressions (regex)](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions) to get a specific part of your text.
+
+`regexextract` is ideal for text that has little to no structure, like URLs or freeform survey responses. If you're working with strings in predictable formats like SKU numbers, IDs, or other types of codes, check out the simpler [substring](../expressions/substring.md) expression instead.
+
+Use `regexextract` to create custom columns with shorter, more readable labels for things like:
+
+- filter dropdown menus, 
+- chart labels, or
+- embedding parameters.
+
+| Syntax                                                        | Example                                 |
+|---------------------------------------------------------------|-----------------------------------------|
+| `regexextract(text, regular_expression)`                      | `regexextract("regexextract", "ex(.*)")`|
+| Gets a specific part of your text using a regular expression. | "extract"                               |
+
+## Searching and cleaning text
+
+Let's say that you have web data with a lot of different URLs, and you want to map each URL to a shorter, more readable campaign name.
+
+| URL                                                   | Campaign Name |
+|-------------------------------------------------------|---------------|
+| https://www.metabase.com/docs/?utm_campaign=alice     | alice         |
+| https://www.metabase.com/learn/?utm_campaign=neo      | neo           |
+| https://www.metabase.com/glossary/?utm_campaign=candy | candy         |
+
+You can create a custom column **Campaign Name** with the expression:
+
+```
+regexextract([URL], "^[^?#]+\?utm_campaign=(.*)")
+```
+
+Here, the regex pattern [`^[^?#]+\?` matches all valid URL strings](https://www.oreilly.com/library/view/regular-expressions-cookbook/9780596802837/ch07s13.html). You can replace `utm_campaign=` with whatever query parameter you like. At the end of the regex pattern, the [capturing group](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Groups_and_Backreferences) `(.*)` gets all of the characters that appear after the query parameter `utm_campaign=`.
+
+Now, you can use **Campaign Name** in places where you need clean labels, such as [filter dropdown menus](../../../dashboards/filters.md#choosing-between-a-dropdown-or-autocomplete-for-your-filter), [charts](../../sharing/visualizing-results.md), and [embedding parameters](../../../embedding/signed-embedding-parameters.md).
+
+## Accepted data types
+
+| [Data type](https://www.metabase.com/learn/databases/data-types-overview#examples-of-data-types) | Works with `regexextract`  |
+| ----------------------- | -------------------- |
+| String                  | ✅                   |
+| Number                  | ❌                   |
+| Timestamp               | ❌                   |
+| Boolean                 | ❌                   |
+| JSON                    | ❌                   |
+
+## Limitations
+
+Regex can be a dark art. You have been warned.
+
+`regexextract` is not supported on H2 (including the Metabase Sample Database), SQL Server, and SQLite.
+
+## Related functions
+
+This section covers functions and formulas that work the same way as the Metabase `regexextract` expression, with notes on how to choose the best option for your use case.
+
+**[Metabase expressions](../expressions-list.md)**
+
+- [substring](#substring)
+
+**Other tools**
+
+- [SQL](#sql)
+- [Spreadsheets](#spreadsheets)
+- [Python](#python)
+
+### Substring
+
+Use [substring](../expressions/substring.md) when you want to search text that has a consistent format (the same number of characters, and the same relative order of those characters). 
+
+For example, you wouldn't be able to use `substring` to get the query parameter from the [URL sample data](#searching-and-cleaning-text), because the URL paths and the parameter names both have variable lengths. 
+
+But if you wanted to pull out everything after `https://www.` and before `.com`, you could do that with either:
+
+```
+substring([URL], 13, 8)
+```
+
+or
+
+```
+regexextract([URL], "^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/.\n]+)")
+```
+
+### SQL
+
+When you run a question using the [notebook editor](https://www.metabase.com/glossary/notebook_editor), Metabase will convert your graphical query settings (filters, summaries, etc.) into a query, and run that query against your database to get your results.
+
+If our [sample data](#searching-and-cleaning-text) is stored in a PostgreSQL database:
+
+```sql
+SELECT
+    url,
+    SUBSTRING(url, '^[^?#]+\?utm_campaign=(.*)') AS campaign_name
+FROM follow_the_white_rabbit
+```
+
+is equivalent to the Metabase `regexextract` expression:
+
+```
+regexextract([URL], "^[^?#]+\?utm_campaign=(.*)")
+```
+
+### Spreadsheets
+
+If our [sample data](#searching-and-cleaning-text) is in a spreadsheet where "URL" is in column A, the spreadsheet function
+
+```
+regexextract(A2, "^[^?#]+\?utm_campaign=(.*)")
+```
+
+uses pretty much the same syntax as the Metabase expression:
+
+```
+regexextract([URL], "^[^?#]+\?utm_campaign=(.*)")
+```
+
+### Python
+
+Assuming the [sample data](#searching-and-cleaning-text) is in a dataframe column called `df`,
+
+```
+df['Campaign Name'] = df['URL'].str.extract(r'^[^?#]+\?utm_campaign=(.*)')
+```
+
+does the same thing as the Metabase `regexextract` expression:
+
+```
+regexextract([URL], "^[^?#]+\?utm_campaign=(.*)")
+```
+
+## Further reading
+
+- [Custom expressions documentation](../expressions.md)
+- [Custom expressions tutorial](https://www.metabase.com/learn/questions/)
--- a/docs/questions/query-builder/expressions/substring.md
+++ b/docs/questions/query-builder/expressions/substring.md
@@ -66,7 +66,7 @@ substring([Mission ID], (1 + length([Mission ID]) - 3), 3)

 ## Limitations

-`substring` extracts text by counting characters from left to right. If you need to extract text based on some more complicated logic, try [`regexextract`](../expressions-list.md#regexextract).
+`substring` extracts text by counting a fixed number of characters. If you need to extract text based on some more complicated logic, try [`regexextract`](../expressions-list.md#regexextract).

 And if you only need to clean up extra whitespace around your text, you can use the [`trim`](../expressions-list.md#trim), [`lefttrim`](../expressions-list.md#lefttrim), or [`righttrim`](../expressions-list.md#righttrim) expressions instead.

@@ -74,15 +74,35 @@ And if you only need to clean up extra whitespace around your text, you can use

 This section covers functions and formulas that work the same way as the Metabase `substring` expression, with notes on how to choose the best option for your use case.

+**[Metabase expressions](../expressions-list.md)**
+
+- [regexextract](#regexextract)
+
+**Other tools**
+
 - [SQL](#sql)
 - [Spreadsheets](#spreadsheets)
 - [Python](#python)

+### Regexextract
+
+Use [regexextract](./regexextract.md) if you need to extract text based on more specific rules. For example, you could get the agent ID with a regex pattern that finds the last occurrence of "00" (and everything after it):
+
+```
+regexextract([Mission ID], ".+(00.+)$")
+```
+
+should return the same result as
+
+```
+substring([Mission ID], 9, 3)
+```
+
 ### SQL

 When you run a question using the [notebook editor](https://www.metabase.com/glossary/notebook_editor), Metabase will convert your graphical query settings (filters, summaries, etc.) into a query, and run that query against your database to get your results.

-If our [sample data](#getting-a-substring-from-the-left) is stored in a SQL database:
+If our [sample data](#getting-a-substring-from-the-left) is stored in a PostgreSQL database:

 ```sql
 SELECT