Can anyone help me think of a way to dynamically generate the patterns section of a tool I am creating? I’m not sure how to store and generate these “patterns” dynamically.
What the program does is take a big list of links (100,000), puts them in a database, groups them by domain and then curls a page from each domain looking for backlinks.
Here is the database scheme, you can see that the Domains table is were most of the information is stored because we group the URLs together by the domain: http://sqlfiddle.com/#!2/9a4d7
So now we know that some domains are live (and have a backlink) and some are dead (no backlink). This is relatively easy but now for the fun part.
I need to derive “patterns” from the live links. For example find a list of all live link domains that have more than 25 links from that domain. So if joesblog.blogspot has 33 pages that link to my domain that matches this pattern. Here is my list of patterns:
- Domains that include a homepage link
- Domains grouped by top level domain (.com, .org etc)
- Domains that returned a 405 header response
- URLs with matching directory structures
- Domains that contain the word _ _ _.
- URLs that contain the word _ _ _ in their path.
- Common anchor text.
- Common title tags.
- Common backlink targets (what page of your site does the link point to).
The problem is that the patterns are changing CONSTANTLY. There being moved around, added, edited, removed and anything else you can think of. I really need a to build a content management system of sorts to handle these patterns. But how would I store something this intricate in a database?
Has anyone ever delt with a similar problem and how did you solve it?
If I could just store whole functions and MySQL statements in the database that would be great (but horrible wrong).
(PHP, MySQL, JavaScript / JQuery)
DISCLAIMER: This is an internal tool. Please don’t ask me why I’m building this or claim that the requirements are wrong. This was designed by my manager and its my task to make it work because I am a developer at the company that needs this tool. Thank you!
3
It seems like there are a few types of patterns. I can see some that are simple text-matching patterns (such as “domain name contains $word”), some which are based on interaction with the host (Domains that return a 405), TLD-based patterns….
I agree that storing executable code in the database is probably not the best approach. You might first need to find ways to categorize different patterns into more general templates. Then, have a database schema to store the pattern metadata, and then a UI to let users work with the patterns. You’d also have to write code for each type of pattern to execute the pattern metadata in the pattern schema and process the data in your main schema.
You could also do this entirely in code instead of the database, by having different class structures to represent different types of patterns, then read the configuration for different patterns from an XML file. The drawback of this is that pattern config is done in XML which is probably slightly less user-friendly than a web-based CRUD interface. It also means that when a user wants to change/create a pattern, the XML config files need to be deployed (because of course they won’t have access to the production server, right? 😉 ), whereas if it’s all done in a database, there is no need to redeploy the changed config (unless you build a tool to deploy changes automatically), it’s available immediately.
But how would I store something this intricate in a database?
You don’t. Each pattern is a combination of program code and SQL queries that produce a certain result.
In PHP, you can write an abstract class that models a set of classes. You write an abstract class that takes a String that gives a short description of the method. This class also has an execute() function that gives the result as an array.
You write a concrete pattern class for each of the patterns you know about, extending the abstract class. That way, all of your pattern classes have the same methods and results.
You write one more loop class that loops through all of these pattern classes with an array of the Strings of the short descriptions you want to execute.
Finally, you set up a main class where you can specify the array of short description Strings. The strings can be read from a file, so you don’t have to modify the main class.
This way, when someone comes up with a new pattern, you create a new pattern class, add the pattern class to the loop class, and you’re done.
1
It sounds like you want a graph database. They basically track nodes and their relationships separately, and most of your queries could be done with their native query languages, if you design the database correctly.
For example, your nodes could be pages, and your relationships could be links to, has TLD, responds with, etc. Graph databases can follow relationships equally well in both directions, so finding all the links on a page is just as easy as finding all the links to a page.