Hi, we're Apify. The Apify platform gives you access to 2,000+ tools to extract data from popular websites. And lots of our scrapers use glob patterns. Check us out.
When working with files or text patterns, two popular methods are often used: glob and regex. Let's take a closer look at globs, understand their purpose, and see how they compare with regex.
What is a glob?
Globbing, in computing, refers to the process of finding filenames based on a pattern. Unlike regular expressions (regex), which can be complex and are used for pattern matching within strings, globs are primarily used to match filenames in filesystems.
The term "glob" is short for "global," a nod to its roots in early Unix systems. Originally, "globbing" was part of the "global command" functionality used for expanding shell patterns. In these early systems, the shell would replace a wildcard pattern with a sorted list of filenames that matched the pattern before executing the command. For instance, a command like ls *.txt
would first expand to ls document1.txt document2.txt
before the ls
command was actually run.
Over time, the concept of globbing evolved. While initially limited to shell commands, globbing capabilities have now been integrated into lots of programming languages and utilities, allowing for more versatile pattern-based file searching. The term "glob" itself has since become synonymous with pattern-based file searching, transcending its initial shell-specific context.
Why are glob patterns useful?
Globs are a straightforward and user-friendly approach to file searching. Their simplicity and ease of use make them popular for locating files, especially when the exact filename is elusive. A few simple wildcards can make the file search process much faster and easier, eliminating the need for intricate commands or patterns.
Glob patterns are great for file handling and searching. They provide a straightforward method to specify and match filenames based on a set pattern. They're especially useful in situations where exact filenames might not be known.
What is a glob pattern?
A glob pattern is a string of characters used to match filenames. It often includes wildcards like * (matches any sequence of characters) and ? (matches a single character).
What is glob style matching?
Glob style matching employs simple wildcards and patterns to match filenames. While it doesn't have the power or complexity of regex, it serves its purpose exceptionally well for simple, pattern-based file searches. For instance, using a wildcard pattern like 'data*.csv' would match filenames that start with 'data' and end with '.csv'.
How does globbing work?
Globbing works by comparing the glob pattern to filenames in a directory. The wildcards in the pattern serve as placeholders that can represent multiple characters, allowing for flexible and broad searches.
How do you write a glob?
Writing a glob involves specifying characters and wildcards to form a pattern. For instance:
Wildcard | Description |
---|---|
* | Matches any sequence of characters. |
? | Matches a single character. |
[abc] | Matches any one of the characters a, b, or c. |
Wildcards and example matches
Wildcards are special characters in glob patterns that stand in for other characters. They are the backbone of the pattern-matching process, allowing you to specify broad or specific criteria for their searches.
Wildcard patterns, when put into practice, can match a variety of filenames. For instance, the wildcard *
can match any sequence of characters, making it one of the most commonly used wildcards in glob patterns.
The ?
wildcard is unique in that it matches any single character. It's particularly useful when you know the structure of a filename but might be unsure about one or two characters. This wildcard gives you precision without sacrificing flexibility.
* Wildcard
Description: The asterisk *
is the most versatile wildcard. It matches any sequence of characters, including none.
Example: The pattern doc*.txt
can match filenames such as document.txt
, doc1.txt
, and docs.txt
.
Matches: Any sequence of characters.
? Wildcard
Description: The question mark ? matches any single character. It's ideal when the overall structure of a filename is known, but one character may vary.
Example: The pattern file?.jpg
can match filenames like file1.jpg
, file2.jpg
, but not file10.jpg
.
Matches: A single character.
[...] Wildcard
Description: Square brackets allow for specific character matches. The pattern matches any one of the characters enclosed within the brackets.
Example: The pattern file[123].txt
will match file1.txt
, file2.txt
, and file3.txt
, but not file4.txt
.
Matches: One of the characters within the brackets.
What is an example of a file glob?
In glob, file paths can be specified with patterns. For instance, to search in a subdirectory, you'd use subdir/*.txt
. To search recursively, some systems support, e.g. /*.txt
would find text files in all subdirectories.
In a directory with files named data1.csv
, data2.csv
, and notes.txt
, the glob pattern data*.csv
would match the first two files but not the third.
A practical example of using glob patterns is for Apify Actors. Actors are serverless cloud programs that run on the Apify platform and do computing jobs. They're called Actors because, like human actors, they perform actions based on a script. We use glob patterns to help our scrapers identify what web pages to crawl.
Glob vs. regex
When to choose globbing over regex
Globbing is best suited for simple file and directory-matching tasks. For example, if you're trying to find all .txt files in a directory, a simple glob pattern like *.txt
will do the job.
Example: Simple file search
Suppose you have a directory containing the following files:
- document1.txt
- document2.txt
- image1.jpg
- image2.png
The glob pattern *.txt
would effortlessly match document1.txt
and document2.txt
, without including the image files.
When to choose regex over globbing
Regex comes into its own when you need more intricate pattern matching, not just with filenames but also within text files, data streams, and so on. For instance, if you're searching for email addresses in a document, a regex pattern like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} would be more appropriate.
Example: Email extraction
Consider a text file that contains the following:
- john.doe@example.com
- mary-smith123@workplace.co
- invalid-email@.com
The regex pattern mentioned above would match john.doe@example.com and mary-smith123@workplace.co but would correctly ignore invalid-email@.com.
Choose globbing for simple, straightforward file and directory-matching tasks. It's easier to write and understand. On the other hand, opt for regex when you need more complex and versatile pattern-matching capabilities. While regex is more difficult to master, it offers greater flexibility for a wide range of tasks beyond just file matching.
While both globs and regex allow for pattern matching, they serve different purposes:
Method | Description |
---|---|
Globbing | Primarily for filename matching. It's simpler and more intuitive but less powerful. |
Regex | A versatile tool for string pattern matching within files, data streams, and more. It's complex but highly flexible. |
Choose globbing for simple file searches and regex for intricate text pattern matching.